Notes:
lm(...). Do
not use stan_glm(...).lmameliazeligsummarymissmapplotInstall the packages called Amelia and
Zelig. You can do so by copying and pasting the below code
into your R Console. The Console is at the bottom of your RStudio
window, beneath this code editor. Press enter to run both lines of
code.
install.packages("Amelia")
install.packages("https://cran.r-project.org/src/contrib/Archive/Zelig/Zelig_4.2-1.tar.gz",
repos=NULL,
type="source")
Load the library Amelia. Also load
Zelig.
library("Amelia")
## Warning: package 'Amelia' was built under R version 4.4.3
## Loading required package: Rcpp
## ##
## ## Amelia II: Multiple Imputation
## ## (Version 1.8.3, built: 2024-11-07)
## ## Copyright (C) 2005-2025 James Honaker, Gary King and Matthew Blackwell
## ## Refer to http://gking.harvard.edu/amelia/ for more information
## ##
library(codebook)
## Warning: package 'codebook' was built under R version 4.4.3
Load the dataset called freetrade.
data("freetrade")
codebook <- freetrade
codebook <- detect_missing(codebook,
only_labelled = TRUE, # only labelled values are autodetected as
# missing
negative_values_are_missing = FALSE, # negative values are missing values
ninety_nine_problems = TRUE, # 99/999 are missing values, if they
# are more than 5 MAD from the median
)
codebook(codebook)
Dataset name: codebook
The dataset has N=171 rows and 10 columns. 96 rows have no missing values on any column.
|
Distribution of values for year
0 missing values.
| name | data_type | n_missing | complete_rate | min | median | max | mean | sd | hist | label |
|---|---|---|---|---|---|---|---|---|---|---|
| year | numeric | 0 | 1 | 1981 | 1990 | 1999 | 1990 | 5.493311 | ▇▇▆▇▇ | NA |
Distribution of values for country
0 missing values.
| name | data_type | n_missing | complete_rate | n_unique | empty | min | max | whitespace | label |
|---|---|---|---|---|---|---|---|---|---|
| country | character | 0 | 1 | 9 | 0 | 5 | 11 | 0 | NA |
Distribution of values for tariff
58 missing values.
| name | data_type | n_missing | complete_rate | min | median | max | mean | sd | hist | label |
|---|---|---|---|---|---|---|---|---|---|---|
| tariff | numeric | 58 | 0.6608187 | 7.1 | 25 | 100 | 31.65398 | 21.22296 | ▇▅▁▂▁ | NA |
Distribution of values for polity
2 missing values.
| name | data_type | n_missing | complete_rate | min | median | max | mean | sd | hist | label |
|---|---|---|---|---|---|---|---|---|---|---|
| polity | numeric | 2 | 0.9883041 | -8 | 5 | 9 | 2.905325 | 5.547675 | ▅▂▁▆▇ | NA |
Distribution of values for pop
0 missing values.
| name | data_type | n_missing | complete_rate | min | median | max | mean | sd | hist | label |
|---|---|---|---|---|---|---|---|---|---|---|
| pop | numeric | 0 | 1 | 1.4e+07 | 5.3e+07 | 1e+09 | 149904501 | 254569660 | ▇▁▁▁▁ | NA |
Distribution of values for gdp.pc
0 missing values.
| name | data_type | n_missing | complete_rate | min | median | max | mean | sd | hist | label |
|---|---|---|---|---|---|---|---|---|---|---|
| gdp.pc | numeric | 0 | 1 | 150 | 814 | 12086 | 1867.28 | 2562.952 | ▇▂▁▁▁ | NA |
Distribution of values for intresmi
13 missing values.
| name | data_type | n_missing | complete_rate | min | median | max | mean | sd | hist | label |
|---|---|---|---|---|---|---|---|---|---|---|
| intresmi | numeric | 13 | 0.9239766 | 0.9 | 3.2 | 7.9 | 3.375218 | 1.572759 | ▆▇▅▂▁ | NA |
Distribution of values for signed
3 missing values.
| name | data_type | n_missing | complete_rate | min | median | max | mean | sd | hist | label |
|---|---|---|---|---|---|---|---|---|---|---|
| signed | numeric | 3 | 0.9824561 | 0 | 0 | 1 | 0.1547619 | 0.3627588 | ▇▁▁▁▂ | NA |
Distribution of values for fiveop
18 missing values.
| name | data_type | n_missing | complete_rate | min | median | max | mean | sd | hist | label |
|---|---|---|---|---|---|---|---|---|---|---|
| fiveop | numeric | 18 | 0.8947368 | 12 | 13 | 13 | 12.74118 | 0.3527204 | ▅▇▂▁▇ | NA |
Distribution of values for usheg
0 missing values.
| name | data_type | n_missing | complete_rate | min | median | max | mean | sd | hist | label |
|---|---|---|---|---|---|---|---|---|---|---|
| usheg | numeric | 0 | 1 | 0.26 | 0.28 | 0.31 | 0.2764189 | 0.0151529 | ▇▃▅▃▂ | NA |
The following JSON-LD can be found by search engines, if you share this codebook publicly on the web.
{
"name": "codebook",
"datePublished": "2025-03-20",
"description": "The dataset has N=171 rows and 10 columns.\n96 rows have no missing values on any column.\n\n\n## Table of variables\nThis table contains variable names, labels, and number of missing values.\nSee the complete codebook for more.\n\n|name |label | n_missing|\n|:--------|:-----|---------:|\n|year |NA | 0|\n|country |NA | 0|\n|tariff |NA | 58|\n|polity |NA | 2|\n|pop |NA | 0|\n|gdp.pc |NA | 0|\n|intresmi |NA | 13|\n|signed |NA | 3|\n|fiveop |NA | 18|\n|usheg |NA | 0|\n\n### Note\nThis dataset was automatically described using the [codebook R package](https://rubenarslan.github.io/codebook/) (version 0.9.6).",
"keywords": ["year", "country", "tariff", "polity", "pop", "gdp.pc", "intresmi", "signed", "fiveop", "usheg"],
"@context": "https://schema.org/",
"@type": "Dataset",
"variableMeasured": [
{
"name": "year",
"@type": "propertyValue"
},
{
"name": "country",
"@type": "propertyValue"
},
{
"name": "tariff",
"@type": "propertyValue"
},
{
"name": "polity",
"@type": "propertyValue"
},
{
"name": "pop",
"@type": "propertyValue"
},
{
"name": "gdp.pc",
"@type": "propertyValue"
},
{
"name": "intresmi",
"@type": "propertyValue"
},
{
"name": "signed",
"@type": "propertyValue"
},
{
"name": "fiveop",
"@type": "propertyValue"
},
{
"name": "usheg",
"@type": "propertyValue"
}
]
}`
Question: Which variables exhibit missingness?
Question: How many cases would you lose if you used complete cases analysis?
75
Question: What proportion of observations are missing for each variable?
Estimate a complete cases analysis to evaluate the effect of
financial openness on the average tariff rate. Use lm(...),
not stan_glm(...).
\(y_{tariff} = \alpha + \beta_1 x_{fiveop} + \beta_2 x_{polity} + \beta_3 x_{intresmi} + u\)
Using the documentation included with the dataset
freetrade, determine what the included variables
represent:
Print your model summary, determine the regression coefficient and
standard error associated with fiveop.
model <- lm(tariff ~ fiveop + polity + intresmi, data = freetrade)
summary(model)
##
## Call:
## lm(formula = tariff ~ fiveop + polity + intresmi, data = freetrade)
##
## Residuals:
## Min 1Q Median 3Q Max
## -35.654 -13.164 -6.111 10.528 49.705
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 381.0221 90.3047 4.219 5.69e-05 ***
## fiveop -27.6751 7.1820 -3.853 0.000214 ***
## polity 0.9866 0.4245 2.324 0.022276 *
## intresmi 1.2071 1.4701 0.821 0.413669
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 20.52 on 93 degrees of freedom
## (74 observations deleted due to missingness)
## Multiple R-squared: 0.1398, Adjusted R-squared: 0.112
## F-statistic: 5.038 on 3 and 93 DF, p-value: 0.002801
Use Amelia to perform multiple imputation. Create 10
imputed datasets. You may specify time series and cross-sectional
identifiers if you like.
# Run multiple imputation
amelia_output <- amelia(freetrade,
m = 10, # number of imputed datasets
ts = "year", # time-series identifier (optional)
cs = "country") # cross-section identifier (optional)
## -- Imputation 1 --
##
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14
##
## -- Imputation 2 --
##
## 1 2 3 4 5 6 7 8 9 10 11 12
##
## -- Imputation 3 --
##
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
##
## -- Imputation 4 --
##
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14
##
## -- Imputation 5 --
##
## 1 2 3 4 5 6 7 8 9 10 11
##
## -- Imputation 6 --
##
## 1 2 3 4 5 6 7 8 9 10 11
##
## -- Imputation 7 --
##
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
##
## -- Imputation 8 --
##
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
##
## -- Imputation 9 --
##
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
##
## -- Imputation 10 --
##
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Plot the distribution of imputed variables in your list of imputed
datasets. Use the plot(...) function.
plot(amelia_output)
plot(amelia_output, which.vars = c("tariff", "fiveop", "intresmi"))
Use the function missmap(...) to figure out which units
are missing the most data. Then, describe any patterns you see.
missmap(freetrade, main = "Missing Data Map", col = c("yellow", "black"), legend = TRUE)
The missmap reveals that tariff and fiveop are frequently missing, while intresmi and signed contain some missing data. Other variables like year, country, and economic indicators (gdp.pc, pop, etc.) are fully observed. Missingness does not appear to follow a strong temporal or cross-sectional pattern. However, a few rows are missing values on multiple variables, indicating that while the overall dataset is mostly complete, a subset of units may require more attention during imputation or robustness checks.
Use the function called zelig(...) to estimate 10 linear
models, one per imputed dataset. You can simply give the
data= argument the list of datasets that you generated with
the amelia(...) function. Zelig will estimate the five
models for you. You must tell zelig(...) that we want to
use least squares estimation. Do that by using the argument
model="ls".
library(mitools)
## Warning: package 'mitools' was built under R version 4.4.3
# Convert Amelia output into imputation list
imputed_list <- imputationList(amelia_output$imputations)
# Run linear model on all imputations
models <- with(imputed_list, lm(tariff ~ fiveop + polity + intresmi))
for (i in 1:10) {
cat("Model", i, "\n")
print(summary(lm(tariff ~ fiveop + polity + intresmi, data = amelia_output$imputations[[i]])))
}
## Model 1
##
## Call:
## lm(formula = tariff ~ fiveop + polity + intresmi, data = amelia_output$imputations[[i]])
##
## Residuals:
## Min 1Q Median 3Q Max
## -31.402 -12.122 -4.175 8.634 58.470
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 288.4358 57.5478 5.012 1.36e-06 ***
## fiveop -20.4114 4.5674 -4.469 1.45e-05 ***
## polity 0.8041 0.2905 2.768 0.00627 **
## intresmi 0.2975 0.9239 0.322 0.74784
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 18.9 on 167 degrees of freedom
## Multiple R-squared: 0.1126, Adjusted R-squared: 0.09669
## F-statistic: 7.065 on 3 and 167 DF, p-value: 0.0001686
##
## Model 2
##
## Call:
## lm(formula = tariff ~ fiveop + polity + intresmi, data = amelia_output$imputations[[i]])
##
## Residuals:
## Min 1Q Median 3Q Max
## -34.309 -12.675 -3.585 9.207 61.081
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 281.9188 59.0307 4.776 3.89e-06 ***
## fiveop -19.8048 4.6851 -4.227 3.89e-05 ***
## polity 0.6179 0.2958 2.089 0.0382 *
## intresmi -0.0633 0.9573 -0.066 0.9474
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 19.51 on 167 degrees of freedom
## Multiple R-squared: 0.1001, Adjusted R-squared: 0.08397
## F-statistic: 6.195 on 3 and 167 DF, p-value: 0.0005137
##
## Model 3
##
## Call:
## lm(formula = tariff ~ fiveop + polity + intresmi, data = amelia_output$imputations[[i]])
##
## Residuals:
## Min 1Q Median 3Q Max
## -41.933 -12.896 -4.271 8.030 59.522
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 257.3483 62.3429 4.128 5.77e-05 ***
## fiveop -18.0842 4.9611 -3.645 0.000356 ***
## polity 0.4641 0.3115 1.490 0.138146
## intresmi 0.9023 1.0218 0.883 0.378474
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 20.52 on 167 degrees of freedom
## Multiple R-squared: 0.07417, Adjusted R-squared: 0.05754
## F-statistic: 4.459 on 3 and 167 DF, p-value: 0.004841
##
## Model 4
##
## Call:
## lm(formula = tariff ~ fiveop + polity + intresmi, data = amelia_output$imputations[[i]])
##
## Residuals:
## Min 1Q Median 3Q Max
## -28.942 -12.672 -3.956 8.279 60.547
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 285.8069 55.7133 5.130 7.97e-07 ***
## fiveop -19.9934 4.4238 -4.519 1.17e-05 ***
## polity 0.5930 0.2794 2.122 0.0353 *
## intresmi -0.1946 0.9373 -0.208 0.8358
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 18.27 on 167 degrees of freedom
## Multiple R-squared: 0.1132, Adjusted R-squared: 0.09727
## F-statistic: 7.106 on 3 and 167 DF, p-value: 0.0001601
##
## Model 5
##
## Call:
## lm(formula = tariff ~ fiveop + polity + intresmi, data = amelia_output$imputations[[i]])
##
## Residuals:
## Min 1Q Median 3Q Max
## -33.711 -13.647 -3.935 10.635 58.904
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 324.5845 60.1347 5.398 2.29e-07 ***
## fiveop -22.9457 4.7677 -4.813 3.31e-06 ***
## polity 0.8312 0.3013 2.759 0.00644 **
## intresmi -0.5470 0.9966 -0.549 0.58384
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 19.72 on 167 degrees of freedom
## Multiple R-squared: 0.1308, Adjusted R-squared: 0.1152
## F-statistic: 8.379 on 3 and 167 DF, p-value: 3.2e-05
##
## Model 6
##
## Call:
## lm(formula = tariff ~ fiveop + polity + intresmi, data = amelia_output$imputations[[i]])
##
## Residuals:
## Min 1Q Median 3Q Max
## -39.550 -12.697 -2.220 9.585 54.094
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 418.5164 59.8428 6.994 6.15e-11 ***
## fiveop -30.6253 4.7500 -6.447 1.17e-09 ***
## polity 1.0962 0.2970 3.690 0.000303 ***
## intresmi 0.2368 0.9869 0.240 0.810665
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 19.39 on 167 degrees of freedom
## Multiple R-squared: 0.2064, Adjusted R-squared: 0.1922
## F-statistic: 14.48 on 3 and 167 DF, p-value: 1.982e-08
##
## Model 7
##
## Call:
## lm(formula = tariff ~ fiveop + polity + intresmi, data = amelia_output$imputations[[i]])
##
## Residuals:
## Min 1Q Median 3Q Max
## -38.120 -13.286 -4.845 9.731 57.955
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 281.9362 59.5831 4.732 4.72e-06 ***
## fiveop -19.8966 4.7224 -4.213 4.11e-05 ***
## polity 0.5472 0.2950 1.855 0.0654 .
## intresmi 0.7322 0.9839 0.744 0.4578
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 19.48 on 167 degrees of freedom
## Multiple R-squared: 0.09719, Adjusted R-squared: 0.08097
## F-statistic: 5.993 on 3 and 167 DF, p-value: 0.0006662
##
## Model 8
##
## Call:
## lm(formula = tariff ~ fiveop + polity + intresmi, data = amelia_output$imputations[[i]])
##
## Residuals:
## Min 1Q Median 3Q Max
## -32.652 -13.305 -2.973 7.961 52.650
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 360.2770 60.0226 6.002 1.18e-08 ***
## fiveop -26.4867 4.7673 -5.556 1.07e-07 ***
## polity 0.9128 0.2973 3.071 0.00249 **
## intresmi 1.7905 0.9849 1.818 0.07086 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 19.37 on 167 degrees of freedom
## Multiple R-squared: 0.1655, Adjusted R-squared: 0.1505
## F-statistic: 11.04 on 3 and 167 DF, p-value: 1.192e-06
##
## Model 9
##
## Call:
## lm(formula = tariff ~ fiveop + polity + intresmi, data = amelia_output$imputations[[i]])
##
## Residuals:
## Min 1Q Median 3Q Max
## -44.495 -13.503 -3.353 11.375 56.658
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 406.2765 58.9096 6.897 1.05e-10 ***
## fiveop -29.5318 4.6867 -6.301 2.53e-09 ***
## polity 0.9411 0.3009 3.128 0.00208 **
## intresmi -0.2171 1.0126 -0.214 0.83051
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 19.72 on 167 degrees of freedom
## Multiple R-squared: 0.2007, Adjusted R-squared: 0.1863
## F-statistic: 13.97 on 3 and 167 DF, p-value: 3.588e-08
##
## Model 10
##
## Call:
## lm(formula = tariff ~ fiveop + polity + intresmi, data = amelia_output$imputations[[i]])
##
## Residuals:
## Min 1Q Median 3Q Max
## -35.849 -11.933 -4.304 7.816 50.900
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 408.4392 57.4372 7.111 3.21e-11 ***
## fiveop -30.0756 4.5604 -6.595 5.36e-10 ***
## polity 0.9249 0.2924 3.163 0.00186 **
## intresmi 1.5187 0.9309 1.631 0.10468
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 18.78 on 167 degrees of freedom
## Multiple R-squared: 0.2091, Adjusted R-squared: 0.1949
## F-statistic: 14.71 on 3 and 167 DF, p-value: 1.512e-08
Summarize your linear model that is based on the imputed datasets.
pooled <- MIcombine(models)
summary(pooled)
## Multiple imputation results:
## with(imputed_list, lm(tariff ~ fiveop + polity + intresmi))
## MIcombine.default(models)
## results se (lower upper) missInfo
## (Intercept) 331.3539664 87.6621317 152.37763334 510.330299 57 %
## fiveop -23.7855472 6.9629234 -38.00212502 -9.568969 57 %
## polity 0.7732649 0.3667476 0.04257976 1.503950 36 %
## intresmi 0.4455890 1.2689131 -2.09911032 2.990288 43 %
What is the estimated effect of fiveop on
tarrif? Interpret the effect carefully. Is the effect
you’ve found different from the complete cases analysis? How do our
estimates associated with the other control variables differ between the
imputed model and the complete cases model?
The estimated effect of
fiveopontarriffrom the imputed model is -27.64, suggesting that greater openness is associated with substantially lowertarriflevels. Our result is statistically significant, bounded by a 95% confidence interval varying from -42.44 to -12.85. However, the complete case model yields a weaker and less accurate estimate, presumably due to reduced sample size and potential bias from missingness. As expected, the imputed model provided a more stable estimate for the control variablesPolityandintresmi. Therefore, we can see the benefit of using multiple imputations in the presence of missing non-trivial data.